knitr::opts_chunk$set(echo = TRUE)

pacman::p_load(tidyverse, gridExtra, mapview, tigris, sf, broom, GGally, caTools)

airbnb <- read.csv("V_airbnb_Boston_part2.csv",
                   # properly format NAs
                   na.strings=c("","NA", "N/A"))
# Filter to only airbnbs with at least 5 reviews to help ensure that mean ratings are fairly accurate
airbnb <- 
  airbnb %>% filter(number_of_reviews >= 5) %>% 
  # Remove unused columns for ease of use
  select(host_id, host_since, host_response_time, host_response_rate, neighbourhood_cleansed, host_listings_count, latitude, longitude, property_type, room_type, amenities, price, number_of_reviews, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value) %>%
  # variable renames
  rename(overall_score = review_scores_rating,
         accuracy_score = review_scores_accuracy,
         cleanliness_score = review_scores_cleanliness,
         checkin_score = review_scores_checkin,
         communication_score = review_scores_communication,
         location_score = review_scores_location,
         value_score = review_scores_value,
         neighborhood = neighbourhood_cleansed)

# properly format price
airbnb$price <- as.numeric(gsub('\\$|,', '', airbnb$price))

Part 1: Introduction

AirBnB, a massive online marketplace for lodging rentals, allows one to rent out his/her home for short to medium length periods of time to anyone interested in staying. For those hoping to rent out their homes as AirBnBs, an obvious question might be: “What will make a listing successful?” Our data, uploaded in December 2021 by Inside AirBnB, can help to answer that question. Inside AirBnB describes themselves as a “mission driven project that provides data and advocacy about AirBnB’s impact on residential communities.” The data provides only a snapshot of the listings at the time it was retrieved and therefore should only be considered sample data due to the high level of changeability of listings at any given time. Here is how Inside AirBnb purports to collect their data: “The data utilizes public information compiled from the AirBnB web-site including the availability calendar for 365 days in the future, and the reviews for each listing. Data is verified, cleansed, analyzed and aggregated.” The data and method of collection does not necessarily suggest any biases that we see. This data and analysis should be of especial interest to anyone considering hosting on AirBnB, or to anyone interested in the details of AirBnB hosting and ratings. For all work below, only AirBnBs with a total number of 5 reviews or greater were considered to try to ensure average ratings were accurate. Of course, ratings are subjective regardless, but filtering to include only slightly larger sample sizes helps ensure the average rating is closer to the true mean. We also did a fair amount of variable renaming and reformatting, and removed any variables we didn’t find useful. All data cleaning steps just mentioned, and more, are shown above.

Part 2: Data Visualizations

ggplot(data = airbnb, mapping = aes(x = overall_score)) + 
  geom_histogram(binwidth = 1, color = "black", fill = "red") + 
  xlab("Overall Score") + 
  ylab("Count") + 
  labs(title = "Histogram of Ratings") + 
  scale_y_continuous(breaks = seq(0, 250, 25)) + 
  scale_x_continuous(breaks = seq(0, 100, 5)) + 
  theme_classic() +
  theme(plot.title = element_text(hjust = .5))

# Tables showing proportion of scores at or above 90, then 95
airbnb %>% 
  filter(!is.na(overall_score)) %>%
  count(overall_score >= 90)
airbnb %>% 
  filter(!is.na(overall_score)) %>%
  count(overall_score >= 95)

The vast majority of average ratings (1415, roughly 84%) are greater than 90 (out of 100 points), and the slight majority even have an overall score greater than or equal to 95. This shows that despite the variety of locations/situations, people are still leaving good reviews. Unless someone had a bad experience with an AirBnB, they generally give it a high score. So, although the difference between a score of 80 and a score of 90 may not sound especially significant, it is within this data set because of the small range in which most of the ratings fall.

ggplot(data = airbnb,
       mapping = aes(x = price, y = overall_score)) +
  
  geom_point(color = "steelblue", alpha = .5) +
  
  geom_smooth(formula = y ~ x,
              color = "skyblue3",
              method = lm, 
              se = F) +
  
  labs(title = "Price and average rating",
       x = "Price ($)",
       y = "Average rating") +
  
  theme_light() +
  
  theme(plot.title = element_text(hjust = .5))

# Simple linear regression
summary(lm(data = airbnb, overall_score~price))
## 
## Call:
## lm(formula = overall_score ~ price, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.125  -2.540   1.294   3.754   6.239 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 9.364e+01  1.847e-01 507.029  < 2e-16 ***
## price       4.035e-03  9.614e-04   4.197 2.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.227 on 1687 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01034,    Adjusted R-squared:  0.009748 
## F-statistic: 17.62 on 1 and 1687 DF,  p-value: 2.842e-05

A simple linear regression for price and average rating yields a significant P-value (<.01) and shows a positive linear relationship. However, with an R^2 value of only about 1%, price is not a particularly strong indicator of average rating.

# convert data to be used with mapView(), removing NAs
airbnb_map <- st_as_sf(airbnb %>% filter(!is.na(overall_score)), 
                       coords = c("longitude", "latitude"), 
                       crs = 4326)

# Display map
mapview(airbnb_map %>% filter(overall_score >= 80), zcol = "overall_score")
# Show only perfectly rated listings to look for pattern
mapview(airbnb_map %>% filter(overall_score == 100), zcol = "overall_score")
# Remove data.frame that won't be used again
rm(airbnb_map)

We used these maps to look for a visible pattern in average score geographically. The first map is filtered to only include listings in the 80-100 range. This is because the vast majority of observations fall within that range and the color scale more clearly communicates the data that way. No obvious pattern exists geographically for the 80-100 range. We decided to investigate further by looking to see whether there is any pattern in where the perfectly rated listings (overall score = 100) are located. Again, no clear pattern is shown.

# Average rating by neighborhood data.frame
rate_by_neigh <- airbnb %>%
  group_by(neighborhood) %>%
  filter(!is.na(overall_score)) %>%
  dplyr::summarise(mean(overall_score), n()) %>% 
  rename(overall_score = "mean(overall_score)",
         num_observations = "n()")


# Show table
rate_by_neigh
# sort by rating
rate_by_neigh$neighborhood <- factor(rate_by_neigh$neighborhood, rate_by_neigh$neighborhood[order(rate_by_neigh$overall_score)])

# Graph
ggplot(data = rate_by_neigh,
       mapping = aes(x = neighborhood, y = overall_score)) +
  
  geom_col(width = .5, fill = "steelblue") +
  
  labs(title = "Average Score by Neighborhood",
       x = "Neighborhood",
       y = "Average rating") +
  
  scale_y_continuous(breaks = c(0,50,90,95,100)) +
  
  theme_bw() +
  
  theme(plot.title = element_text(color = "#003366"),
        panel.grid.major.y = element_blank(),
        panel.grid.minor = element_blank()) +
  
  coord_flip()

# ANOVA to see whether difference in neighborhoods is statistically significant
summary(aov(airbnb$overall_score~airbnb$neighborhood))
##                       Df Sum Sq Mean Sq F value Pr(>F)    
## airbnb$neighborhood   24   3965   165.2   6.453 <2e-16 ***
## Residuals           1664  42603    25.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness

Further exploring the geographic placement of listings, this section looks at average ratings by neighborhood. The table and graph above show some variance in average rating between neighborhoods, with all 25 specified neighborhoods nonetheless falling in the 90-100 range. To check whether these differences between neighborhoods were statistically significant, we ran a one-way ANOVA. This test yields a significant P-value (<2e-16), suggesting that mean ratings are not equal across all neighborhoods. Post-hoc testing would need to be done to determine specifically which neighborhoods significantly differ from one another.

ggplot(data = airbnb, 
       mapping = aes(x = host_listings_count, y = overall_score)) +
  
  geom_point(color = "darkcyan", shape = "diamond", size = 2, alpha = .6) +
  
  geom_smooth(formula = y ~ x, method = lm, 
              se = F, color = "black") +
  
  labs(title = "Ratings of airbnb vs. number of listings host has",
       x = "Number of listings host has",
       y = "Averating rating") +
  
  theme_light()

# Simple linear model
summary(lm(airbnb$overall_score~airbnb$host_listings_count))
## 
## Call:
## lm(formula = airbnb$overall_score ~ airbnb$host_listings_count)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.424  -2.424   1.546   3.546  12.710 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                94.484214   0.133968 705.276  < 2e-16 ***
## airbnb$host_listings_count -0.015115   0.002389  -6.327 3.19e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.193 on 1687 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.02318,    Adjusted R-squared:  0.0226 
## F-statistic: 40.03 on 1 and 1687 DF,  p-value: 3.189e-10

A scatter plot and simple linear regression show a statistically significant, negative linear correlation between an AirBnB’s average rating and the total number of listings the given host has. However, we once again have a small R^2 of only about 2%. Other than price, this was the only other numeric predictor we agreed was worth considering, so we were disappointed to once again find again find only quite a weak correlation between it and rating. In the next section we begin to include categorical variables in the regression model.

Part 3: Machine Learning Methods

As stated in our project proposal, the goal of this section is to build a regression model that predicts average rating as accurately as possible. To begin, we reformatted some of the categorical data to be more functional for building regression models. One of the first categories that caught our attention was the “amenities” variable, which stores a list of every amenity a given AirBnB is listed as having. Because there were many amenities that almost surely have trivial or no effect on ratings, such as a place having clothes hangers, we chose to just focus on seven that we thought might have the largest impact. The code chunk below adds a column to the data.frame for each of the amenities we chose.

# Assign new column for each of the seven amenities we selected
airbnb <- airbnb %>%
  mutate(Breakfast = FALSE,
         Patio = FALSE,
         Pool = FALSE,
         Parking = FALSE,
         Gym = FALSE,
         Washer = FALSE,
         Air_Conditioning = FALSE,
         id = row_number())

# Loop to check for each amenity
for (val_index in 1 : (nrow(airbnb))) {
  val <- airbnb[val_index, "amenities"]
  # format string so it can be made into a 'clean' vector 
  val <- noquote(val)
  val <- gsub('\\[|\\]|"', "", val)
  val <- unlist(strsplit(val, ", "))
  
  # Check for each amenity
  if ("Breakfast" %in% val) {
    airbnb[val_index, "Breakfast"] = TRUE
  }
  if ("Patio or balcony" %in% val) {
    airbnb[val_index, "Patio"] = TRUE
  }
  if ("Pool" %in% val) {
    airbnb[val_index, "Pool"] = TRUE
  }
  if ("Free parking on premises" %in% val) {
    airbnb[val_index, "Parking"] = TRUE
  }
  if ("Gym" %in% val) {
    airbnb[val_index, "Gym"] = TRUE
  }
  if ("Washer" %in% val) {
    airbnb[val_index, "Washer"] = TRUE
  }
  if ("Air conditioning" %in% val) {
    airbnb[val_index, "Air_Conditioning"] = TRUE
  }
}
# Correlation matrix
ggcorr(data = airbnb %>% dplyr::select(overall_score,
                                               host_listings_count,
                                               price),
               low = "red",
               mid = "grey90",
               high = "blue",
               label = T,
               label_round = 2,
               label_color = "black") 

To begin building a regression model, we started by creating the correlation matrix above. Consistent with the scatterplots and simple linear models shown in part 2, the matrix shows that only weak correlations between the numeric variables we are considering are present.

# lm with all the variables we're looking at to start
airbnb_lm <- lm(overall_score ~ host_listings_count + price + Breakfast + Pool + Patio + Parking + Gym + Washer + Air_Conditioning + neighborhood,
                data = airbnb)

lm_table <- 
  tidy(airbnb_lm) %>% 
  mutate(across(.cols = where(is.numeric),
                .fns = round, 
                digits = 3))

lm_table
# Fit stats
fit_stats <- 
  glance(airbnb_lm) %>% 
  select(r.squared:sigma) %>% 
  round(digits = 3)

fit_stats

Multiple linear regression proved fairly ineffective in predicting average ratings. Even including all the variables we were considering in the model, regardless of their P-value, and even considering only the unadjusted R^2 value, we were still only able to get a coefficient of determination of about 13%. This model could be slightly improved by removing variables whose additions are not statistically significant, but we agreed this would likely make for minimal improvement to an already weak model.

Although it seemed likely that this data was simply not capable of predicting average rating with a useful degree of accuracy, we chose to also try building a decision tree. To begin, we assigned listing ratings to three categories, “poor” (less than 80), “good” (80 to 90), and “great” (90 or better). Although 80 may sound like a low threshold, we felt that these categories were appropriate relative to where the distribution of ratings falls in the data set, especially given that the vast majority of ratings fall in the 80-100 range. Unfortunately, the number of predictor variables we wanted to include, combined with the fact that they predicted the average score so poorly, made a decision tree even less accurate and harder to interpret than the multiple linear regression. The initial setup for a decision tree is shown below, but we chose not to include any tree in our final draft due to their lack of quality and interpretability regarding this research question.

# Create groups
airbnb$rate_cat[airbnb$overall_score < 80] <- "poor"
airbnb$rate_cat[airbnb$overall_score >= 80 & airbnb$overall_score <= 90] <- "good"
airbnb$rate_cat[airbnb$overall_score > 90] <- "great"

# Make rate_cat an ordered factor
airbnb <- airbnb %>% 
  mutate(rate_cat = factor(rate_cat,
                           levels = c("poor", "good", "great"),
                           ordered = T))


# Create split function
holdout_split <- function(df, pred, train_percent = 0.80){

  df_y <- df[,pred]
  df_split <- sample.split(df_y, SplitRatio = train_percent)
  
  return(list(train_x = tibble(df[df_split, colnames(df)!=pred]), 
              train_y = df_y[df_split], 
              test_x = tibble(df[!df_split, colnames(df)!=pred]), 
              test_y = df_y[!df_split]))
}

# create training and testing datasets
RNGversion('4.0.0')
set.seed(123)
holdout_airbnb <- holdout_split(df = airbnb, 
                                pred = "rate_cat", 
                                train_percent = 0.70)

# Create a data.frame for just the training data:
airbnb_train <- data.frame(holdout_airbnb$train_x,
                           rate_cat = holdout_airbnb$train_y)

# Create a data.frame for the testing data:
airbnb_test <- data.frame(holdout_airbnb$test_x,
                          rate_cat = holdout_airbnb$test_y)


# Check split is correct
bind_rows(train = airbnb_train %>% 
                  dplyr::select(rate_cat) %>% 
                  table() %>% 
                  prop.table() %>% 
                  round(digits = 3),

          test = airbnb_test %>% 
                 dplyr::select(rate_cat) %>% 
                 table() %>% 
                 prop.table() %>% 
                 round(digits = 3),
          .id = "dataset") 

Part 4: Conclusions

Despite our extensive analysis, this data set did not prove conducive to answering our research question. We found multiple statistically significant relationships between average rating and other predictor variables we considered, such as price, neighborhood, and the number of listings a host has. Unfortunately, the magnitude of effect of these predictor variables on the average score was generally quite low. This proved to be an issue when attempting to build a regression model. Not only were most of the variables we were considering fairly weak predictors of rating, but almost all the ratings also fell within a very small range, making the variability difficult for a regression model to explain. We ran into similar issues with a decision tree. A combination of too many predictors and their lack of ability to make accurate predictions in any combination made a decision tree inaccurate and near uninterpretable. We suspect that the subjectivity of these reviews is a significant obstacle to our research question as well. AirBnB gives essentially no guidance as to how to rate their listings, and most people seem to give arbitrarily high scores if they don’t run into any significant problems. This could make ratings quite hard to predict no matter what data one is given. As much as we would have liked to find significant results, our biggest conclusion was that what ratings people will give AirBnBs is very hard to predict.

Part 5: Limitations / Recommendations

Although our data set provides a wide variety of information about AirBnBs across Boston, there are a number of inherent limitations to both our data and analysis. First of all, although one can learn a lot about an AirBnB by looking at its facts and figures in this data, this information by no means tells the full story. For example, the quality of a property is very hard to quantify or reflect in any form of data. Many prospective renters will also rely heavily on pictures to determine the quality an AirBnB. The number of pictures a listing has can be recorded, but the quality of those pictures, or how appealing they make a place look, is much harder to include in an analysis of this nature. Furthermore, the data lacks information on how often these AirBnBs are booked, which to us would play a significant role in defining their success. Our analysis of the data also has some limitations as well. For one thing, ratings are subjective. This means that what one person may find to be a perfectly acceptable AirBnB, another may not. Our data provides no background on the people providing the ratings, therefore it is impossible to say what characteristics of individuals may impact how they choose to rate a listing. Also, as mentioned previously, this data set only includes observations from Boston in December of 2021. For this reason, it is not appropriate to generalize any of these results to the greater US or any region beyond the immediate Boston area. To further explore what makes an AirBnB listing successful, we would recommend investigating data that includes observations from a wider region or one that gives details on how frequently individual listings are booked. Of course, this data may not exist or be readily available, but that information could help one to expand on our research question and findings.